--- name: pytdc description: Therapeutics Data Commons (PyTDC) for AI-ready therapeutic ML datasets and benchmarks; use it when you need standardized dataset loading, meaningful splits (e.g., scaffold/cold-start), and consistent evaluation for ADME/Toxicity/DTI/DDI or molecular optimization. license: MIT author: aipoch --- > **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills) ## When to Use - You need **curated, AI-ready datasets** for drug discovery tasks (e.g., ADME, toxicity, bioactivity, DTI/DDI). - You want **standardized benchmarks** with consistent evaluation protocols (including multi-seed benchmark groups). - You require **meaningful data splits** such as **scaffold splits** (chemical diversity) or **cold-start splits** (unseen drugs/targets). - You are building models for **single-instance prediction** (molecular/protein properties) or **multi-instance prediction** (interactions). - You are doing **goal-directed molecular generation** and need **property Oracles** for scoring/optimization. ## Key Features - **Dataset access by task type** - Single-instance prediction: ADME, Toxicity (Tox), HTS, QM, and more. - Multi-instance prediction: DTI, DDI, PPI, and more. - Generation: MolGen, RetroSyn, PairMolGen. - **Standardized splitting utilities** - `random`, `scaffold`, and cold-start variants such as `cold_drug`, `cold_target`, `cold_drug_target`, plus `temporal` where applicable. - **Unified evaluation** - Built-in `Evaluator` with common metrics (ROC-AUC, PR-AUC, RMSE, MAE, Spearman, etc.). - **Benchmark groups** - Curated collections (e.g., ADMET group) with recommended evaluation protocols (commonly 5 seeds). - **Chem/data utilities** - Molecular format conversion (e.g., SMILES → PyG), filtering, balancing, negative sampling, entity retrieval (CID→SMILES, UniProt→sequence). - **Molecular Oracles** - Property scoring functions usable for goal-directed generation workflows (see `references/oracles.md`). ## Dependencies Install (recommended): ```bash uv pip install PyTDC ``` Upgrade: ```bash uv pip install PyTDC --upgrade ``` Core runtime dependencies (installed automatically; versions depend on the PyTDC release you install): - `PyTDC` (latest from PyPI) - `numpy` - `pandas` - `scikit-learn` - `tqdm` - `seaborn` - `fuzzywuzzy` Optional dependencies may be pulled in automatically depending on which submodules you use (e.g., graph backends or chemistry toolchains). ## Example Usage A complete runnable example that: 1) loads an ADME dataset, 2) performs a scaffold split, 3) trains a simple baseline model, 4) evaluates with a standard metric, 5) queries an Oracle score for a SMILES. ```python # pip install PyTDC scikit-learn from tdc.single_pred import ADME from tdc import Evaluator, Oracle from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import Ridge def main(): # 1) Load a single-instance prediction dataset (ADME) data = ADME(name="Caco2_Wang") # 2) Create a scaffold split (train/valid/test) split = data.get_split(method="scaffold", seed=42, frac=[0.7, 0.1, 0.2]) train, valid, test = split["train"], split["valid"], split["test"] # 3) Train a simple baseline model on SMILES strings # (character n-gram + ridge regression; replace with your own model) model = Pipeline( steps=[ ("featurizer", CountVectorizer(analyzer="char", ngram_range=(2, 5))), ("regressor", Ridge(alpha=1.0)), ] ) model.fit(train["Drug"], train["Y"]) # 4) Evaluate on the test set using a TDC Evaluator y_pred = model.predict(test["Drug"]) evaluator = Evaluator(name="MAE") mae = evaluator(test["Y"], y_pred) print(f"Test MAE: {mae:.4f}") # 5) Oracle scoring example (property scoring for a SMILES) oracle = Oracle(name="DRD2") score = oracle("CC(C)Cc1ccc(cc1)C(C)C(O)=O") print(f"DRD2 Oracle score: {score}") if __name__ == "__main__": main() ``` Related references and templates (if present in this skill package): - Oracle catalog and usage: `references/oracles.md` - Utility functions (splits, processing, retrieval): `references/utilities.md` - Dataset catalog: `references/datasets.md` - Workflow templates: `scripts/load_and_split_data.py`, `scripts/benchmark_evaluation.py`, `scripts/molecular_generation.py` ## Implementation Details ### 1) Dataset Access Pattern PyTDC datasets follow a consistent interface: ```python from tdc. import data = (name="") df = data.get_data(format="df") split = data.get_split(method="scaffold", seed=1, frac=[0.7, 0.1, 0.2]) ``` - `` is typically one of: - `single_pred` (single-entity property prediction) - `multi_pred` (pairwise/multi-entity interaction prediction) - `generation` (molecule/reaction generation tasks) ### 2) Splitting Strategies (Key Parameters) Use `get_split(...)` to obtain `{"train": ..., "valid": ..., "test": ...}`. Common parameters: - `method`: split strategy - `seed`: random seed for reproducibility - `frac`: `[train, valid, test]` fractions (when supported) Typical methods: - `random`: random shuffling split - `scaffold`: Bemis–Murcko scaffold-based split to reduce scaffold leakage and improve chemical generalization - Cold-start (commonly for interaction tasks like DTI/DDI): - `cold_drug`: test contains unseen drugs - `cold_target`: test contains unseen targets - `cold_drug_target`: test contains unseen drugs and targets - `temporal`: time-based split for datasets with timestamps (when available) Example: ```python split = data.get_split(method="cold_target", seed=1) ``` ### 3) Standardized Evaluation TDC provides a unified evaluator: ```python from tdc import Evaluator evaluator = Evaluator(name="ROC-AUC") # classification score = evaluator(y_true, y_pred) ``` Choose metrics appropriate to the task type: - Classification: `ROC-AUC`, `PR-AUC`, `F1`, `Accuracy`, etc. - Regression: `RMSE`, `MAE`, `R2`, `Spearman`, `Pearson`, etc. ### 4) Data Schemas (What Columns to Expect) While schemas vary by task, common conventions include: - Single-instance prediction (e.g., ADME/Tox): - `Drug` (often SMILES) and label `Y` - sometimes `Drug_ID` / `Compound_ID` - Multi-instance prediction (e.g., DTI): - `Drug` (SMILES), `Target` (protein sequence), label `Y` - plus identifiers such as `Drug_ID`, `Target_ID` ### 5) Oracles for Molecular Optimization Oracles provide a callable scoring interface: ```python from tdc import Oracle oracle = Oracle(name="GSK3B") score = oracle("CCO...") scores = oracle(["SMILES1", "SMILES2"]) ``` Use Oracles to: - score candidate molecules during generation, - define optimization objectives, - compare molecules under consistent property predictors. For the full list of Oracles and their expected inputs/outputs, see `references/oracles.md`.